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ADDRESS AGGREGATION SYSTEM AND METHOD FOR INCREASING 
THROUGHPOT OP ^RESSES TO A DATA CACHE e||^ A PROCESSOR 

FIELD OP TH E INVENTIQW 

The present invention generally relates to computer 
processor architectures, and more particularly, to an 
address aggregation system and method for increasing 
throughput of addresses to a data cache from a processor 
that executes instructions out of order, to thereby enhance 
performance. 



BACKGROUND OP THE TMVHWTTnKy 
A con^uter processor (processing unit) generally 
comprises a control unit, which directs the operation of 
the system, and an arithmetic logic unit (ALU), which 
performs conputational operations. The design of a 
processor involves the selection of a register set(s), 
communication passages between these registers, and a means 
of directing and controlling how these operate. Normally, 
a processor is directed by a program, which consists of a 
series of instructions that are kept in a main memory. 
Each instruction is a group of bits, usually one or more 
words in length, specifying an operation to be carried out 
by the processor. in general, the basic cycle of a 
processor comprises the following steps: (a) fetch an 
instruction from main memory into an instruction register; 
(b) decode the instruction (i.e., determine what it 
indicates should be done; each instruction indicates an 
operation to be performed arid the data to which the 
operation should be applied) ; (c) carry out the operation 
specified by the instruction; and (d) determine where the 
next instruction is located. Normally, the next 
instruction is the one immediately following the current 



one. 



. However, in high performance processors, such as 

superscalar^^ocessors where two or mor^^caler operations 
are performed in parallel, the processor may be designed to 
perform instructions that are out of order, or in an order 
5 that is not consistent with that defined by the software 
driving the processor. In these systems, instructions are 
executed when they can be executed, as opposed to when they 
appear in the sequence defined by the program. Moreover, 
after execution of out of order instjnictions , the results 

10 are ultimately reordered to correspond with the instruction 
order. 

A cache memory is often employed in association with 
a processor in a computer in order to optimize performance. 
A cache memory is a fast buffer located between the 

15 processor and the main memory of the computer. Data and 
instructions in current use in the processor are moved into 
the cache memory, thereby producing two benefits. First, 
the average access time for the processor's memoiry requests 
are reduced, increasing the processor's throughput. 

20 Second, the processor's utilization of the available memory 
bandwidth is thereby reduced, allowing other devices on the 
system bus to use the memory without interfering with the 
processor. A cache memory is thus used to speed up the 
flow of instructions and data into the processor from the 

25 main memory. This cache function is iirportant because the 
main memory cycle time is typically slower thcui processor 
clocking rates. 

When a processor accesses a data cache for a data 
line, the processor forwards an address to the cache. The 

30 cache parses a cache index from the address and uses it to 
select a storage location (s) that may contain the desired 
data line. The cache outputs a tag, which is a real page 
number (RPN) in some designs, corresponding with the 
location (s) and a status indicator, which indicates whether 



the data line corresponding with the tag is valid or 
invalid. ^ ^ 

Support circuitry, typically associated with the 
cache, receives the status indicator and the tag. When the 
status indicator indicates invalid data, then the support 
circuitry forwards a "miss" indication to the processor, in 
which case the processor must access the main memory for 
the data line. When the status indicator indicates valid 
data, the support circuitry compares the tag with the 
remainder of the address in order to determine if the cache 
is currently storing the desired data line. When the cache 
does not have the data line being requested as determined 
by the tag comparison, then the support circuitry forwards 
a "miss" indication to the processor, in which case the 
processor must access the main memory for the data line. 
When the cache does have the data line being requested as 
determined by the tag comparison, then the support 
circuitry forwards a "hit" indication to the processor, 
which prompts the processor to read the requested data 
line. 

In processors that perform out of order execution of 
instructions, it is desirable to make multiple simultaneous 
accesses to the data cache to enhance throughput from the 
processor to the cache memory and overall speed of the 
processor, it would be possible to utilize a cache memory 
having multiple ports, one corresponding with each access 
to the cache memory. However, this solution is undesirable 
as these cache designs are costly and not suitable for mass 
production of inexpensive processors and coii?>uters 
implementing large off -chip caches. 
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An objeeFof the present invention w to overcome the 
inadequacies and deficiencies of the prior art as discussed 
above in the background section. 

Another object of the present invention is to improve 
the performance of processors that execute instructions out 
of order. 

Another object of the present invention is to provide 
a system and method for inexpensively implementing multiple 
accesses to a data cache (DCACHE) associated with a 
processor of a computer. 

Another object of the present invention is to provide 
a system and method for increasing the efficiency of 
addressing of a DCACHE by a processor and data transfers 
from the DCACHE to the processor. 

Another object of the present invention is to provide 
a system and method for increasing throughput of data from 
a DCACHE to an associated processor, while ensuring 
reliability. 

Briefly described, the present invention provides for 
an address aggregation system that enhances the performance 
of a processor that executes instructions out of order by 
maximizing the usage of read ports of a DCACHE associated 
with the processor. In essence, the processor is 
configured to forward a plurality of addresses generated by 
instructions in an instruction reordering mechanism, for 
example, a memory queue (MQTJEUE) , to respective cache banks 
made from corresponding single ported storage devices, such 
as a random access memory (RAM). In the preferred 
embodiment, an odd memory address and an even memory 
address are concurrently forwarded to the DCACHE during 
each cycle. 

In architecture, the processor comprises an 
instruction cache (ICACHE) , an instruction fetch mechanism 
(IFETCH) for retrieving instructions from the ICACHE, a 



sort mechanism for receiving instructions from the IFETCH 
and for sor^g the instructions :^o arithmetic 
instructions ana memory instructions, and a reordering 
mechanism, such as the MQUEXJE, for receiving the memory 
instructions from the sort mechanism and permitting the 
instructions to execute out of order. The MQUEUE includes 
a plurality of address reorder buffer slots (ARBSLOTs) , an 
odd bank arbitrator, and an even bank arbitrator. Each of 
the ARBSLOTs maintains an address, determines whether the 
address is either odd or even, and generates either a 
respective odd or even requests depending upon whether the 
address is either odd or even. The odd and even bank 
arbitrators receive the requests associated with the odd 
and even addresses respectively and control the slots to 
output addresses to the cache. 

The invention can also be viewed as providing a novel 
method for processing data addresses in a processor and 
increasing throughput of the data addresses to a data cache 
from the processor. The method, as broadly conceptualized, 
con^jrises the following steps: maintaining a plurality of 
independent banks in the cache; collecting data addresses 
in the processor; allocating each of the data addresses to 
a particular one of the banks; and communicating an address 
to each of the banks during a single cycle of the 
processor. 

Other objects, features, and advantages of the present 
invention will become apparent to one of skill in the art 
upon examination of the following drawings and detailed 
description. it is intended that all such additional 
objects, features, and advantages be included herein within 
the scope of the present invention, as defined by the 
claims. 



TK'PjvrR pgSgRTPTIOW OP THE DRAWIKGS 

The in^idtion can be better unders^d with reference 
to the following drawings. In the drawings, the schematic 
illustrations of the various components therein are not 
necessarily to scale, emphasis instead being placed upon 
clearly illustrating principles of the invention. 
Furthermore, like reference numerals designate 
corresponding parts throughout the several views. 

Pig. 1 is a block diagram showing a computer 
implementing the address aggregation system of the present 
invention; 

Pig. 2 is a block diagram showing a possible 
implementation of an instruction fetch/execution system in 
a processor of Fig. 1 and its relationship to a data cache 
(DCACHE) associated with the processor of Pig. 1; 

Fig. 3 is a block diagram showing a possible 
implementation of the novel address aggregation system of 
Pig. 1; 

Fig. 4 is a schematic diagram showing a possible 
in5)lementation of logic in each address reorder buffer slot 
(ARBSLOT) of Pig. 3 in order to sort addresses into odd and 
even sets; and 

Pigs. 5A-5D illustrate block diagrams showing a 
possible implementation of the arbitrators of Fig. 3; more 
specifically. Pig. 5A is a high level block diagram of the 
overall architecture; Pig. 5B is a block diagram of the 
oldest logic of Pig. 5A; Pig. 5C is a block diagram of the 
low done logic of Pig. 5A; and Fig. 5D is a block diagram 
of the grant decision logic if Fig. 5A. 



PgTAILED DESCRIPTTOW QP THE PREPERBED EMBODIMEMT 
As Shown Fig. i, the address aggr^tion system 80 
(see Fig. 3 fo^iore details) and associated methodology of 
the present invention is implemented within a coit^uter il, 
and particularly, in connection with a memory queue 
(MQUBtJE) 38b of an instruction fetch/ execution system 12 
within a processor 14 and in connection with a data cache 
(DCACHE) 24 connected to the processor 14 of the con5>uter 
11. The computer li generally comprises the processor 14, 
a main memory 16 having software (S/W) 18 for driving the 
processor 14, the DCACHE 24 in the form of single ported 
storage devices, such as random access memories (RAMs) , 
interconnected with the processor 14 as indicated by 
reference arrow 23. and a system interface 22, such as one 
or more buses, interconnecting the processor 14 and the 
main memory 16. m operation, as the instruction 
fetch/execution system 12 in the processor 14 executes the 
software 18, data that is in current use in the processor 
14 is moved into the DCACHE 24 under the control of 
instructions in the MQUEUE 38b, thereby reducing the 
average access time for the processor's memory requests and 
minimizing traffic on the system interface 22. Finally, it 
should be mentioned that, with the exception of the novel 
address aggregation system 100, all of the aforementioned 
computer components and their interactions are well known 
and understood in the art. 

A typical cache line in the DCACHE 24 includes a tag, 
a status indicator, and data. A cache index is forwarded 
to the DCACHE 24 and is used by the DCACHE 24 to select a 
storage location(s) that may contain the desired data line. 
In response to receipt" of a cache index, the DCACHE 24 
outputs a tag, which is a real page number (rpn) in the 
preferred embodiment, corresponding with the location (s), 
a status indicator, which indicates whether the data line 
corresponding with the tag is valid or invalid, and data. 



which may be valid or invalid. Typically, the status 
indicator ixiBbates the following state^l "invalid, " which 
means that no data is present; "valid shared," which means 
that data is present, but may be also located elsewhere; 
"valid private clean, " which means that the line has the 
sole copy and the DCACHE 24 has not yet written to the 
line; and "valid private dirty, " which means that the line 
has the sole copy and that the DCACHE 24 has written to the 
line (and thus needs to copy the line to main memory 16. 

A tag ccmpare mechanism 228 (not shown) associated 
with the DCACHE 24, receives the status indicator and the 
tag. When the status indicator indicates invalid data, 
then the tag compare mechanism forwards a "miss" indication 
to the processor 14, in which case the processor 14 
accesses the main memory 16 for the data line. When the 
status indicator indicates valid data, the tag compare 
mechanism con?)ares the tag with the remainder of the 
address in order to determine if the DCACHE 24 is currently 
storing the desired data line. When the DCACHE 24 does not 
have the data line being requested as determined by the tag 
comparison, then the tag compare mechanism forwards a 
"miss" indication to the processor 14, in which case the 
processor 14 accesses the main memory 16 for the data line. 
When the DCACHE 24 does have the data line being requested 
as determined by the tag con5>arison, then the tag compare 
mechanism forwards a "hit" indication to the processor 14, 
which prompts the processor 14 to read the requested data 
line. 

A possible iniplementation of the instruction 
fetch/execution system 12 is illustrated by way of block 
diagram in Fig. 2. As shown in Fig. 2, the instruction 
fetch/execution system 12 has an instruction cache (ICACHE) 
26 for storing instructions from the software 18 (Fig. D . 
An instruction fetch mechanism (IFETCH) 28 communicates 
with the ICACHE 26 and retrieves instructions from the 



ICACHB 26 for ultimate execution. m the preferred 
embodiment, th^IPETCH 28 fetches four i^ructions {each 
32 bits) at a Kie and transfers the instructions to a sort 
mechanism 32. 

The sort mechanism 32 determines whether each 
instruction is destined for an arithmetic logic unit (ALU) 
or the memory and distributes the instructions accordingly 
into an arithmetic logic unit queue (AQUEUE) 38a and the 
MQUBUE 38b, respectively, as indicated by corresponding 
reference arrows 36a, 36b. 

The AQUEUE 38a contains ALU instruction processing 
mechanisms 39a (in the preferred embodiment, there are 28 
in number) that have registers 4la for storing respective 
Instructions that are directed to an arithmetic logic unit 
42, as indicated by reference arrow 43. The instructions 
in the AQUEUE 38a are executed in any order possible 
(preferably, in data flow fashion), and as they con«>lete, 
the results are captured and marked complete. 

The ALU 42, under the control of the AQUEUE 38a, can 
retrieve operands from rename registers 44a, 44b and 
general registers 46, as is indicated by interface 45. 
After the ALU 42 operates on the operands, the results of 
the operation are stored in the AQUEUE rename registers 
44a, as delineated by reference arrow 49. 

The MQUEUE 38b contains instruction processing 
mechanisms 39b. Each instruction processing mechanism 39b 
includes a register 41b for storing a respective memory 
instruction and includes an address reorder buffer slot 
(ARBSLOT; in the preferred embodiment, there are 28 in 
number), denoted by reference numeral 48, for storing a 
respective address. Memory instructions in the MQUEUE 38b 
can be classified as "loads" and "stores" to memory. A 
"load" is a request to transfer data from memory (DCACHE 24 
or main memory 16) to a register, whereas a "store" is a 
request to transfer data from a register to memory. 



During execution of an instruction, a first phase 
involves exJtlting a prescribed matheini^cal operation on 
operands in order to compute an address, and a second phase 
involves accessing the memory/cache for data based upon the 
calculated address. The MQUEUE 38b executes each of the 
instructions and the two phases (address computation and 
memory /cache access) of execution in any order possible 
(preferably, in data flow fashion) . As the instructions 
complete, the results are captured by the MQUEUE rename 
registers 44b and the instruction is marked as conplete in 
the MQUEUE 38b. In the preferred embodiment, the MQUEUE 
38b receives up to four instructions (32 bits each) per 
cycle from the sort mechanism 32 and transfers up to two 
instructions (32 bits) per cycle to a retire mechanism 52, 
as indicated by reference arrow 51b. 

More specifically, during the first phase of 
instruction execution, an address is generated by an 
address calculator 58. The address calculator 58 computes 
the address based upon operands retrieved from the rename 
registers 44b and passes the address (real or virtual) to 
an ARBSLOT 48 corresponding to the instruction in the 
MQUEUE 38b, as indicated by reference arrow 62. Control of 
the calculation by the instruction is indicated by the 
reference arrow 64 in Pig. 2. When the second phase of 

memory instruction execution is pursued, the calculated 
address (including a cache index) is transferred to the 
DCACHE 24, as indicated by the reference arrow 54, to 
accomplish a load or a store at the DCACHE 24. In the 
preferred embodiment, two addresses are transferred each 
cycle, if possible, from the MQUEUE 38b to the DCACHE 24. 
Once the DCACHE 24 processes the address, the data results 
are transferred to the rename registers 44b, as indicated 
by reference arrow 56. 

The retire mechanism 52 receives executed instructions 
(preferably, two 32 -bit words per cycle) frcan each of the 



queues 38a, 38b, as indicated by reference arrows 51a, 5ib. 
The retire inech|nism 52 commits the instn|jion results to 
the architectuW state. The software iS^Fig. i) is not 
made aware of any results that are not transformed to the 
architecture state by the retire mechanism 52. The retire 
mechanism 52 retires the instructions in the queues 38a, 
38b in the program order defined by the software is by 
moving the instruction results to a general register 46 
and/or a control register 72, as indicated by respective 
reference arrows 73, 74, depending upon the instruction's 
attributes, and causes the results of the instruction to be 
passed from the rename registers 44a, 44b to the general 
registers 46, as indicated by the reference arrows 76a, 
76b. 

When the retire mechanism 52 retires an instruction 
that resulted in a store to a data line in the DCACHE 24, 
the retire mechanism 52 forwards the data line to the 
DCACHE 24 and marks the status indicator corresponding with 
the line as -dirty, « to indicate that the line has changed 
and should ultimately be forwarded to the main memory 16 
for updating the line at the main memory 16. 

The retire mechanism 52 also has logic for determining 
whether there is an exception associated with an 
instruction. An exception is a flag that indicates a 
special circumstance corresponding with one of the 
currently retiring instructions. m the event of an 
exception, the retire mechanism 52 discards all 
instructions within the queues 38a, 38b that follow the 
instruction that indicated the exception and causes the 
IPETCH 28 to retrieve once again the instructions at issue 
for re-execution or to retrieved special software to handle 
the special circumsteuice . 



The add^s aggregation system 8cW)f the present 
invention will now be described with reference to Fig. 3. 
in accordance with the address aggregation system, the 
processor 14 is configured to forward a plurality of 
addresses to respective cache banks in corresponding 
single-ported storage devices that form the DCACHE 24. In 
the preferred embodiment, an odd memory address and an even 
memory address are concurrently forwarded to respective odd 
and even cache banks of the DCACHE during each cycle. 

The address aggregation system 80 is implemented by 
way of resources primarily situated in the MQUEUE 38b and, 
as indicated in Fig. 3. The address calculator 28 involves 
adders 82a. 82b. each of which receives two input operands 
84 (reference arrow 45 in Fig. 1) from the rename registers 
44b. The adders 82a. 82b operate upon their respective 
input operands 84 and generate addresses 62a, 62b, 

respectively. ■ 
The MQUEUE 38b, as constructed pursuant to the 
invention, includes a plurality of ARBSLOTs 48, one for 
storing each address. There are 28 ARBSLOTs 48 in the 
preferred embodiment; however, any number could be 
employed. An odd bank arbitrator 84a and an even bank 
arbitrator 84b are both in communication with each of the 
ARBSLOTS 48. as indicated by respective arrows 86a. 86b. 
Typically, two addresses are forwarded by the MQUEUE 38b to 
the DCACHE 24 during each cycle, one being odd and the 
other being even in the preferred embodiment. The odd and 
even addresses are output from respective ARBSLOTs 48. as 
indicated by reference arrows 88a. 88b. or are output from 
bypass paths 92a, 92b, respectively. The bypass paths 92a. 
92b essentially forward the addresses on respective inputs 
62a. 62b directly to the DCACHE 24. when controlled to do 
so The bypass paths 92a. 92b are utilized when no valid 
address (for either the odd or even cache port) is ready to 



be transferred to the DCACHE 24 so that cycles are not 
wasted and higl^erformance is achieved. ^ 

Bach instSfction in the MQDEUE 38b calculates its 
address once its dependencies have cleared. Once an 
address has been calculated, the instruction indicates this 
status and requests the MQUEUE 38b to be launched to the 
DCACHE 24. The arbitration logic, either the odd bank 
arbitrator 84a or the even bank arbitrator 84b depending 
upon whether the address corresponding with the instruction 
is either odd or even, decides when and whether to launch 
the address to the DCACHE 24. The corresponding arbitrator 
84a, 84b selects the oldest address (either odd or even) 
and launches the oldest. 

The instructions of the MQUEUE 38b execute out of 
order as operands become available. Accordingly, addresses 
are calculated out of order and the addresses received by 
the MQUEUE 38b may be out of order. However, the order of 
the addresses that are sent from the MQUEUE 38b to the 
DCACHE 24 are prioritized by the order dictated by the 
software 18 (Fig. i) . This implementation results in a 
performance advantage because priority is given to the 
oldest instruction, and the configuration optimally 
interfaces addresses to the software 18 (Pig. i) . 

As further illustrated in Fig. 3, a multiplexer 
mechanism 93 handles the direct and bypass paths from the 
MQUEUE 38b. The multiplexer mechanism 93 includes 
multiplexers (MUX) 94a, 94b, which receive respective 
addresses 88a, 92a and 88b, 92b from the MQUEUE 38b. In 
essence, the multiplexers 94a, 94b control whether an 
address is communicated from the ARBSLOTs 48 to the DCACHE 
24, or alternatively, whether an address is communicated 
from the bypass paths 92a, 92b to the DCACHE 24. The 
multiplexers 94a, 94b are controlled by the odd or even 
arbitrator 84a, 84b, as indicated by reference arrow 96. 
The multiplexers 94a, 94b transfer a selected address to 



respective and even banks 98a, 98b^as indicated by 
reference arRs 99a, 99b. An odd adWss and an even 
address are transferred to the DCACHE 24 during a single 
cycle in the typical operation. Occasionally, only either 
an odd or an even address is available, in which case only 
the single odd or even address is transferred to the DCACHE 
24 during that particular cycle. However, the foregoing 
scenario is rare. Finally, the multiplexers 94a, 94b are 
controlled to select the bypass paths 92a, 92b when no 
ARBSLOT 48 requests the port of the DCACHE 24. 

The logic associated with each ARBSLOT 48 for sorting 
the addresses into odd and even sets and generating 
requests for the arbitrators 84a, 84b is set forth in Fig. 
4. with reference to Fig. 4, each address is stored in an 
ARBSLOT register 104. Each address includes a cache index 
101 for accessing the DCACHE 24, an odd/even (0/E) bit(s) 
102, and a plurality of bits 103 constituting the byte 
offset relative to the DCACHE 24. The byte offset 103 is 
typically ignored when the cache is accessed. The 
foregoing elements are successive in the preferred 
embodiment . 

Each ARBSLOT 48 examines the 0/E bit 102 in the 
register 104 and receives the inverse (-DM) of a 
dependent -on-miss (DM) input 114, a cache address valid 
input (CA^VALID) 116, and a cache pending input (CP) 118 in 
order to derive an odd request 107 for arbitrator 84a, an 
even request 109 for arbitrator 84b, or neither, in terms 
of architecture, the logic of the ARBSLOT 48 includes an 
inverter 112 for producing -DM from the DM input, and 
inverter 113 for producing ~0/E from the 0/E bit 102, an 
AND logic 106 for generating an odd request, an AND logic 
108 for generating an even request 109. The odd and even 
requests 107, 109 are forwarded to the respective odd and 
even arbitrators 84a, 84b (Pig. 3) . 



The inputs to the AND logic 106 are the 0/E bit 102, 
the signal -01^114', the signal CA_VAL:i0ii6 indicating 
whether or noPthis ARBSLOT 48 contains a valid address, 
and the signal CP lis indicating whether or not the address 
needs to be sent to the DCACHE 24. Both signals CA_VALID 
116 and CP lis should be asserted in order for a request 
107, 109 to be generated. The DM input 114 is asserted 
{-DM deasserted) when the ARBSLOT 48 currently needs data 
that is not in the DCACHE 24, but has already been 
requested from the main memory 16 (Pig. i) . All ARBSLOTs 
48 that are dependent on this miss data are fed with an 
asserted dm input 114 so that the corresponding ARBSLOTS 48 
refrain from requesting data from the main memory 16. As 
an example, circuitry that can be utilized to generate the 
DM input 114 is described in detail in copending 
application entitled, "Miss Tracking System And Method", 
filed the same day as the instant application, by the 
inventer herein. 

The AND logic 108, which generates the even request 
109, receives the -0/E 102, the -DM 114', the CA_VALID 116 
and the CP iia. when all of the foregoing signals are 
asserted, the AND logic 108 generates an even request 109 
for the even arbitrator 84b. 

The specific logic associated with a possible 
in^jlementation of each of the odd and even bank arbitrators 
84a, 84b (Fig. 3) will now be described in detail with 
reference to Fig. 5. For simplicity, the logic for only 
one of the arbitrators 84a, 84b is shown in Figs. 5A-5D and 
will be described hereafter, but it should be understood 
that the logic is generally the same for the other. 

In the preferred embodiment, the arbitrator 84 of Pig 
5A is designed to determine and launch the oldest 
instruction situated within the MQUEUE 38b. The ARBSLOTs 
48 can each provide a single request (one of REQ[27:0]) to 
each arbitrator 84. Prom these requests REQt27:0], the 



arbitrator 84 (odd or even) grants only a single ARBSLOT 48 
(odd or evenlfthe ability to launch its 4lkress during each 
cycle. In this regard, the arbitrator 84 provides 
GRANT [27:0] to the ARBSLOTs 48, respectively. 
5 In architecture, as shovm in Pig. 5A, each arbitrator 

84 includes oldest logic 121 for determining the oldest 
group of eight requests (i.e., one of REQt27:24], 
RBQ[23:16], REQtl5:8l, REQ[7:0]); note that the fourth 
group has only four, as there are only twenty eight 

10 ARBSLOTs 48 and instructions in the preferred embodiment) . 
The oldest logic 121 receives four retire pointers 
RET [25, 17, 9,1] and outputs four signals OLD [3:0], one 
corresponding to each group of eight requests, as indicated 
by reference arrow 122. The retire pointers RET [27:0] 

15 indicate where the next two instructions to retire are 
located. At any given time, two of the foregoing retire 
pointers are asserted, thereby indicating the oldest 
requests REQ[27:0] . In essence, the retire pointers 
RET [27:0] are generated from a circular shift chain with 

20 two latches in the chain containing an asserted variable 
("1"), each of which transitions to a deasserted variable 
("0") whenever the associated MQUEUE instruction retires. 

Low done logic 124 determines whether a first half of 
the oldest group of reqpiests has conpleted launching. For 

25 example, assvune that requests REQ[7:0] are the oldest 
group. In this scenario, the low done logic 124 determines 
whether the requests REQ[3:0] have already retired. The 
low done logic 124 outputs a single signal (LOW_DONE) for 
indicating this information, as is indicated by reference 

30 arrow 126, based upon the input retire pointers 
RET[25,21,17,13,9,5,1] that are input to it. 

Each group of four requests (i.e., 
REQ[27:24, 23:20, 19:16, 15:12, 11:8, 7:4, 3:0] iS grouped and 
forward to OR logic. For purposes of sin^licity, only the 

35 first two groups of four requests (REQ[7:4,3:0] ) are 



illustrated in Pig. 5A. As shovm, each group of four 
requests (REQ^:4, 3:0] ) , denoted by re|Mrence numerals 
131-134, 136-lW, are communicated to res^ctive OR logic 
141, 142 to generate corresponding signals REQOR[01, 
REQ0R[1], denoted by reference numerals 143, 144. Hence, 
the OR operation yields REQORtSrO] ) based upon REQ[27:0] ) . 

Grant decision logic 146 receives the signals OLD [3:0] 
122, LOW_DONE 126, REQ[27:0], and REQOR[6:0]. Based upon 
the logic states of the foregoing signals, the grant 
decision logic 146 launches an address from one of the 
ARBSLOTs 48 by asserting one of the corresponding grant 
signals GRANT[27:0]. 

The preferred embodiment of the oldest logic is set 
forth in Fig. 5B. As shown in Fig. 5B, the oldest logic 
121 implements a circular shift chain 161 that sets a bit 
to indicate which group of eight requests is the oldest 
based upon the retire pointers RET [25, 17, 9, 1] . The chain 
161 includes a transistor 162 actuated by a retire pointer 
RET[i] 164 and connected to a master/slave latch (M/S) 166, 
which provides an oldest signal OLD[0] 168; a transistor 
172 actuated by a retire pointer RET [9] 174 and connected 
between the M/S latch 166 and a M/s latch 176, which 
provides an. oldest signal OU)[i] 178; a transistor 182 
actuated by a retire pointer RET [17] 184 and connected 
between the M/S latch 176 and a M/S latch 186, which 
provides an oldest signal OLD [2] 188; and a transistor 192 
actuated by a retire pointer RET [25] 194 and connected 
between the M/S latch 186 and a M/S latch 196, which 
generates an oldest signal OLD [3] 198. Recall that the 
retire pointers [RET [25, 17, 9, i] ) , denoted by 
corresponding reference numerals 164, 174, 184, 194, 
indicate where the next two instructions to retire are 
located. At any given time, one of the OLD [3:0] is 
asserted, thereby indicating the oldest set of eight 
requests . 



The preferred embodiment of the low done logic 124 is 
shown in Fig^C. With reference to Fig^^C, the low done 
logic 124 generates the signal LOW_DONE 126 based upon the 
states of retire pointers RET (29, 25, 21, 17, 13, 9, 5, 1] 
denoted by respective reference numerals 201-208. The low 
done logic 124 includes a latch 211, which receives the 
retire pointers RET [9, 5] 201, 202 at its set and clear 
(CLR) inputs, respectively, and generates em output 212 
that actuates a transistor 214 having its source 216 
connected to a wire-OR output 126. A latch 221 receives 
the retire pointers RET [17, 13] 203, 204 at its set and 
clear inputs, respectively, and produces an output 222 that 
actuates a transistor 224 having a source 226 connected to 
the wire-OR output 126. A latch 231 receives the retire 
pointers RET [25, 21] 205, 206 at its set and clear inputs, 
respectively, and produces an output 232 that actuates a 
transistor 234 having its source 236 connected to the 
wire-OR output 126. A latch 241 receives the retire 
pointers RET [27, 1] 207, 208 at its set and clear inputs, 
respectively, and produces an output 242 which actuates a 
transistor 244 having its source 246 connected to the 
wire-OR output 126. By the aforementioned arrajigement , the 
low done logic 124 determines which half in the oldest 
group of eight has already retired. 

It should be noted that it is not necessary to 
identify the oldest instruction, but only the oldest set of 
four instructions. This is sufficient because there is 
always a gap between the oldest instruction and the 
youngest valid instruction and because this logic does not 
have to always select the oldest. 

The preferred embodiment for inclement ing the grant 
decision logic 146 (Fig. 5A) is set forth in detail in Fig. 
5D. Referring to Fig. 5D, the grant decision logic 146 
includes qualify logic 252, which receives the inputs: 
OLD [3:0], LOW_DONE, and REQOR[6:0]. The qualify logic 252 



mpxements tne boolean equations and OR logic 254, 256 as 
indicated in ^g. SD upon the aforemen^ned inputs to 
generate a sMes of quality signals ^AL[6:0]. Por 
sinplicity. only the boolean equations for the first eight 
requests REQt7:0] and their corresponding resultant qualify 
signals QUAL[li01 as denoted by reference numerals 158 
159, are shown in Pig. SD. However, the pattern of boolean 
equations is repeated. The qualify signals QUALTe-O] 
indicate which group of four requests should be focused 
upon for the next launch grant. Thus, in the case of 
requests REQ[7:0], the qualify signals QUAL[l, 0] indicate 
which four (either REQ[7:4] or REQ[3:0]) should be focused 
upon next to grant a launch. 

The grant decision logic 146 further includes a 
plurality of AND logic mechanisms, only the first eight of 
which are shown for simplicity, as designated by reference 
numerals 261-268. The first eight AND logic mechanisms 
261-268 evaluate the qualify signals QUAL[l:0], REQ[7:0] 
and -'REQ[6:0] in order to produce GRANT[7:0] . 

More specifically, the AND logic 261 receives the 
qualify signal QUALfO] and the request REQ[0] and generates 
therefrom a grant signal GRANT[03 , which determines whether 
or not the first ARBSLOT will launch. The AND logic 262 
receives the request REQ[i], ~req[o], and QUAL[0] and 
generates a grant signal GRANT [i], denoted by reference 
numeral 152, which is forwarded to a corresponding ARBSLOT 
48 for determining when the corresponding ARBSLOT 48 is to 
launch. The AND logic 263 receives the request REQ[2J, 
-REQ[i], -REOro], and the QUALfO] and generates therefrom 
a grant signal GRANT [2], denoted by reference numeral 153 
whach is forwarded to a corresponding ARBSLOT 48 for 
determining when the corresponding ARBSLOT 48 is to launch 
The AND logic .264 receives REQ[3] , ~req[2] , -REQ[l]' 

generates therefrom a grant signal 
GRANT [3], denoted by reference numeral 154, which is 



forwarded to a corresponding ARBSLOT 48 for determining 
when the coH^sponding ARBSLOT 48 is launch. The AND 
logic 265 receives REQ{4] and QUALtH and determines 
therefrom a grant signal GRANT I4l, denoted by reference 
numeral 155, which is forwarded to a corresponding ARBSLOT 
48 to determine when the corresponding ARBSLOT 48 is to 
launch. The AND logic 266 receives REQ[5], ~REQt4), and 
QUAL[l] and determines therefrom a grant signal GRANT [5], 
denoted by reference numeral 156, which is forwarded to a 
corresponding ARBSLOT 48 to determine when the 
corresponding ARBSLOT 48 is to launch. The AND logic 267 
receives REQt6], ~REQ[5], -REQ[4] , and QUALdl and 
generates therefrom a grant signal GRANT I 6], denoted by 
reference numeral 157, which is forwarded to a 
corresponding ARBSLOT 48 for determining when the 
corresponding ARBSLOT 48 is to la\inch. The AND logic 268 
receives the REQ[73, ~REQ[6], ~REQ[5], ~REQ[4], and the 
QUAL[1] and generates therefrom a grant signal GRANT [7], 
denoted by reference numeral 158, which is forwarded to a 
corresponding ARBSLOT 48 for determining when the 
corresponding ARBSLOT 48 is to launch its address. 

Many variations and modifications may be made to the 
preferred embodiment of the invention as described 
previously. As an example, the queues 38a, 38b in the 
processor 14 could be replaced by euiy suitable instruction 
reordering mechanism, including a reservation station. All 
such modifications and variations are intended to be 
included herein within the scope of the present invention, 
as is defined by the following claims. In the claims 
hereafter, the structures, materials, acts, and equivalents 
of all means -plus -function elements and all 
step-plus -fxuiction elements are intended to include any and 
all structures, materials, or acts for performing the 
specified functions. 
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1 . A system for a computer that executes instructions out of order, comprising: 
a data cache having an odd bank and an even bank; 
5 a processor configured to concurrently fonvard addresses to corresponding 

cache banks during a single processor cycle; said processor comprising: 
an instruction cache; 

an Instruction fetch mechanism configured to retrieve instructions from said 
instruction cache; 

10 a sort mechanism configured to receive instructions from said instmction 

fetch mechanism and configured to sort said instructions into arithmetic 
instructions and memory instructions; 

a queue configured to receive said memory instructions from said sort 
mechanism, said queue having: 

15 a plurality of address reorder buffer slots, each said address reorder buffer 

slot configured to maintain an address, to detemiine whether said address is odd 
and to generate a respective odd or even request depending upon whether saki 
address is odd or even; and 

a bank arbitration mechanism configured to receive said odd and even 
20 requests and to control said address reorder buffer slots to output odd and even 
addresses, respectively, to said data cache. 

2. The system of claim 1. further comprising a means associated with said 
processor for executing instructions out of order and for receiving said addresses 

25 pursuant to said executing instructions out of order. 

3. The system of claim 1. wherein said data cache comprises a plurality of 
single ported random access memories. 
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4. The system of claim 1. wherein said bank arbitration mechanism includes 
Odd and even bank arbitrators that are configured to detemiine which of said odd 
and even addresses, respectively, are earliest to be received and configured to 



cause the earliest odd and even addresses to be forwarded together to said data 
cache. 



5. The system for enhancing the perfomance of a computer that executes 
5 instructions out of order, the system comprising: 

(a) a data cache having an odd bank and an even bank; 

(b) a processor having: 

(1 ) an instmction cache means for caching instructions: 

(2) an instmction fetch means for retrieving instructions from said 
10 instruction cache means; 

(3) a sort means for receiving Instructions from said instruction 
fetch means and for sorting said instructions into arithmetic instmctlons and 
memory instructions; 

(4) a queue means for receiving said memory instructions from 
IS said sort means, said queue means having: 

(i) a plural'ity of address reorder buffer slots, each said 
address reorder buffer slot configured to maintain an address, to determine 
whether said address is odd, and to generate a respective odd or even request 
depending upon vy^ether said address Is odd or even; and 
20 (ii) a bank arbitration means for receiving said odd and 

even requests respectively and for controlling said address reorder buffer slots to 
output one of said addresses to each of said banks, respectively, of said data 
cache during a single processor cycle. 

25 6. The system of claim 5, wherein said odd and even addresses are fonvarded 
to respective ports of respective single ported storage devices associated 
respectively with said banks during said single processor cycle. 
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